Refining Duplicate Detection for Improved Data Quality

نویسندگان

  • Yu Huang
  • Fei Chiang
چکیده

Detecting duplicates is a pervasive data quality challenge that hinders organizations from extracting value from their data sooner. The increased complexity and heterogeneity of modern datasets has lead to the presence of varying record formats, missing values, and evolving data semantics. As data is integrated, duplicates inevitably occur in the integrated instance. One of the challenges in deduplication is determining whether two values are sufficiently close to be considered equal. Existing similarity functions often rely on counting the number of required edits to transform one value to the other. This is insufficient in attribute domains, such as time, where small syntactic differences do not always translate to ’closeness’. In this paper, we propose a duplication detection framework, which adapts metric functional dependencies (MFDs) to improve the detection accuracy by relaxing the matching condition on numeric values to allow a permitted tolerance. We evaluate our techniques against two existing approaches using three real data collections, and show that we achieve an average 25% and 34% improvement in precision and recall, respectively, over non-MFD versions.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

A New Method for Duplicate Detection Using Hierarchical Clustering of Records

Accuracy and validity of data are prerequisites of appropriate operations of any software system. Always there is possibility of occurring errors in data due to human and system faults. One of these errors is existence of duplicate records in data sources. Duplicate records refer to the same real world entity. There must be one of them in a data source, but for some reasons like aggregation of ...

متن کامل

A Novel Framework and Model for Data

Data cleansing is a process that deals with identification of corrupt and duplicate data inherent in the data sets of a data warehouse to enhance the quality of data. This paper aims to facilitate the data cleaning process by addressing the problem of duplicate records detection pertaining to the „name‟ attributes of the data sets. It provides a sequence of algorithms through a novel framework ...

متن کامل

Using Trainable Duplicate Detection for Automated Public Data Refining

Public institutions share important data on the Web. These data are essential for public investigation and thus increase transparency. However, it is difficult to process them, since there are numerous mistypings, disambiguities and duplicates. In this paper we propose an automated approach for cleaning of these data, so that further querying result is reliable. We develop a duplicate detection...

متن کامل

Semantic Similarity Match for Data Quality

Data quality is a critical aspect of applications that support business operations. Often entities are represented more than once in data repositories. Since duplicate records do not share a common key, they are hard to detect. Duplicate detection over text is usually performed using lexical approaches, which do not capture text sense. The difficulties increase when the duplicate detection must...

متن کامل

An Efficient Approach for Near-duplicate page detection in web crawling

The drastic development of the World Wide Web in the recent times has made the concept of Web Crawling receive remarkable significance. The voluminous amounts of web documents swarming the web have posed huge challenges to the web search engines making their results less relevant to the users. The presence of duplicate and near duplicate web documents in abundance has created additional overhea...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2017